Analysis of Cyber-security Threats

The purpose of this data analysis is to help model and predict cyber-security threats to help optimize global digital security. As well, to analyze which defense mechanisms that are optimal in combating cyber attacks. The dataset contains 10 variables that are potential significant covariates in predicting potential threats. The variable country refers to the country where the attack occurred. The other variables analyze incident resolution time in hours, number of affected users, financial loss, attack type, target industry, attack source, security vulnerability type and defense mechanism used. Here is a full list of variables:

  • Country country where the attack occurred
  • Year year of the incident
  • Attack Type type of cyber-security threat
  • Target Industry industry targeted
  • Financial Loss estimated financial loss in millions
  • Number of Affected Users number of users impacted by the attack
  • Attack Source origin of the attack
  • Security Vulnerability Type type of vulnerability exploited
  • Defense Mechanism Used cyber-security defense strategy applied
  • Incident Resolution Time time taken to fully resolve event in hours
cyber <- read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/Global_Cybersecurity_Threats_2015-2024.csv')
cyber$Defense.Mechanism.Used <- ifelse(cyber$Defense.Mechanism.Used == "AI-based Detection", "AI", cyber$Defense.Mechanism.Used)

# plot 1 number of affected users by financial loss by vulnerability type


ggplot(cyber, aes(x = cyber$Number.of.Affected.Users, y = cyber$Financial.Loss..in.Million..., color = cyber$Security.Vulnerability.Type)) +
geom_smooth(aes(x = cyber$Number.of.Affected.Users, y = cyber$Financial.Loss..in.Million..., color = cyber$Security.Vulnerability.Type), method = "lm", se = FALSE) +
  scale_size(range = c(2, 10)) +
  labs(
    title = "Plot of Security Vulnerability, Volume Compromised, and Financial Impact",
    x = "Volume Compromised",
    y = "Financial Impact",
    color = "Security Vulernability"
  ) +
  theme_minimal()
This plot shows the relationship between security vulnerability type and the interaction between the number of affected users and financial loss in millions. In most cases the number of affected users seems to be positively correlated with financial loss. The only exception is seen in zero-day. zero-day refers to a type of vulnerability where the target of the attack is unaware of the attack. Meaning there are zero days for the problem to be solved before it has been exploited. It would make sense that in this type of vulnerability, the fewer users affected would lead to a greater financial loss.

This plot shows the relationship between security vulnerability type and the interaction between the number of affected users and financial loss in millions. In most cases the number of affected users seems to be positively correlated with financial loss. The only exception is seen in zero-day. zero-day refers to a type of vulnerability where the target of the attack is unaware of the attack. Meaning there are zero days for the problem to be solved before it has been exploited. It would make sense that in this type of vulnerability, the fewer users affected would lead to a greater financial loss.

# plot 3 Density plot of attacked industries
Year_data <- cyber %>%
  filter(Year == 2015)
p1<- ggplot(Year_data, aes(x = Number.of.Affected.Users, fill = Country)) +
  geom_density(alpha = 0.6) +
  labs(
    x = "Number of Affected Users",
    y = "Density",
    fill="Country",
    title = "Density Plot of Attacked Countries in 2015"
  ) +
  theme_minimal()
ggplotly(p1)

This plot shows the density of number of affected users in cybersecurity attacks in each country in the year 2015.

 Year_data1 <- cyber %>%
  filter(Year == 2024)
# plot 4 Density plot of Attacked countries
p2 <- ggplot(Year_data1, aes(x = Number.of.Affected.Users, fill = Country)) +
  geom_density(alpha = 0.6) +
  labs(
    x = "Number of Affected Users",
    y = "Density",
    fill="Country",
    title = "Density Plot of Attacked Countries in 2024"
  ) +
  theme_minimal()
ggplotly(p2)

This plot shows the density of number of affected users in cybersecurity attacks in each country in the year 2024.

Each country is given a specific color and the number of affected users is displayed on the x-axis. The density is displayed on the y-axis. Density in this plot shows the relative frequency of number of affected users per attack in each country.In 2015, the cyber attacks within the USA more frequently affected about 310,000 users, compared to Russia where attacks more frequently attacked 670,000 users and Germany where most attacks more frequently affected 790,000 users. In 2024, the number of affected users in the USA does not seem to be much more dense in any one number and is more dispersed. Japan had more frequent attacks that affected 500,000 and Brazil had it’s most frequent at 750,000 users.

ggplot(cyber, aes(x = cyber$Year, y = cyber$Incident.Resolution.Time..in.Hours., color = cyber$Security.Vulnerability.Type)) +
geom_smooth(aes(x = cyber$Year, y = cyber$Incident.Resolution.Time..in.Hours., color = cyber$Security.Vulnerability.Type), method = "lm", se = FALSE) +
  scale_size(range = c(2, 10)) +
  labs(
    title = "Plot of Security Vulnerability, Year of Attack, and Resolution Time",
    x = "Year of Attack",
    y = "Incident Resolution Time in Hours",
    color = "Security Vulernability"
  ) +
  theme_minimal()
This plot shows the relationship between security vulnerability type and the interaction between the year of attack and resolution time. The plot shows how over the years resolution time has improved for certain vulnerability types, such as weak passwords, unpatched software, and social engineering. However, for zero-day there is an increase in resolution time, showing a positive relationship between zero-day and resolution time.

This plot shows the relationship between security vulnerability type and the interaction between the year of attack and resolution time. The plot shows how over the years resolution time has improved for certain vulnerability types, such as weak passwords, unpatched software, and social engineering. However, for zero-day there is an increase in resolution time, showing a positive relationship between zero-day and resolution time.

#plot of zero-day attack type by target industry
zero_data <- cyber %>%
  filter(Security.Vulnerability.Type == "Zero-day") 
a <- ggplot(zero_data, aes(x = zero_data$Attack.Type, fill =zero_data$Target.Industry )) +
  geom_bar(position = "dodge") +
  labs(
    x = "Threat Type",
    y = "Proportion",
    fill = "Target Industry",
    title = "Zero-day Incidences by Threat Type vs. Target Industry"
  ) +
  theme_minimal()
ggplotly(a)

This plot focuses on zero-day vulnerability and shows the proportion of threat types in each industry. The x-axis shows each threat type and the y-axis shows the proportion of attacks within the dataset that fall into each category. The different colors are meant to show a subcategory, target industry. This shows the proportion of the dataset that falls within each category of threat type in each target industry.

Zero-day attacks seem to commonly use Phising within both retail and IT industries, based on the height and count of the bar in these two categories. Ransomware is the most minimally used zero-day attack on the government, with a count of just 12.

col0=c("#4682B4", "#B4464B","#9900FF","#FF66FF", "#009933")
boxplot(Incident.Resolution.Time..in.Hours. ~ Defense.Mechanism.Used,
        data=cyber,
        xlab="Defense Mechanism",
        ylab="Resolution Time (hours)",
        col = col0,
        main="Incident Resolution Time by Defense Mechanism",
        cex.main = 0.9,
        col.main = "blue")
The bloxplot shows the relationship between defense mechanism and resolution time in hours.

The bloxplot shows the relationship between defense mechanism and resolution time in hours.

The boxplot shows that the average resolution time among attacks based on defense mechanisms is relatively the same. The only exception seems to be firewall, which appears to take slightly less resolution time on average.We can indicate the average by looking the black horizontal line within each box. They all appear to line up with one another indicating similar averages.

Skewness

cyber$Country[cyber$Country=="USA"]<-1
cyber$Country<- ifelse(cyber$Country!="USA",0)
cyber$Security.Vulnerability.Type[cyber$Security.Vulnerability.Type=="Zero-day"]<-1
cyber$Security.Vulnerability.Type<-ifelse(cyber$Security.Vulnerability.Type!="Zero-day",0)
cyber$Number.of.Affected.Users<-as.numeric(cyber$Number.of.Affected.Users)
cyber$Year<-as.numeric(as.factor(cyber$Year))
cyber$Incident.Resolution.Time..in.Hours.<-as.numeric(cyber$Incident.Resolution.Time..in.Hours.)


c<-skewness(cyber$Financial.Loss..in.Million...)
d<-skewness(cyber$Number.of.Affected.Users)
e<-skewness(cyber$Year) 

Skewness = cbind(Financial.Loss = c, 
                 Affected.Users =  d, 
                 Year = e)

pander(Skewness)
Financial.Loss Affected.Users Year
-0.01684 -0.02537 -0.02749

The data is normally skewed and no transformation is necessary.

Linear Regression

Linear regression is explored below:

# Predictors (x) and Response Variable (y)

set.seed(12345)
X <- as.data.frame(cyber[, -5])  

y <- cyber$Financial.Loss..in.Million...


train_index <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- makeX(X[train_index, ])
X_test <- makeX(X[-train_index, ])
y_train <- y[train_index]
y_test <- y[-train_index]

#standardizing the data
preprocess_params <- preProcess(X_train, method = c("center","scale"))
X_train <- predict(preprocess_params, X_train)
X_test <- predict(preprocess_params, X_test)


#fitting the model
fit_lasso <- glmnet(X_train, y_train, alpha = 1)  # Lasso
fit_ridge <- glmnet(X_train, y_train, alpha = 0)  # Ridge
fit_elastic_net <- glmnet(X_train, y_train, alpha = 0.5)  # Elastic Net
#cross-validation
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)       # Lasso CV
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0)       # Ridge CV
cv_elastic_net <- cv.glmnet(X_train, y_train, alpha = 0.5)  # Elastic Net CV

#plot coefficient path
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(fit_lasso, xvar = "lambda", label = TRUE,
     lwd = 1.5,
     main = "Coefficient Path Analysis: LASSO",
     cex.main = 0.9,
     col = rainbow(10))
abline(v = log(1), col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)
The x-axis of the coefficient path analysis shows the log of lambda. A low lambda has less regularization and is closer to ordinary least squares. A high lambda indicates higher regularization and at this lambda many coefficients will shrink to zero. The Y-axis shows the coefficient values for each variable predictor for financial loss. The lines shows the path of each coefficient has lambda differs. The dashed line shows the optimal level of shrinkage. Coefficients to the right of this line are more forced towards zero. At this chosen lambda only certain variables will be non-zero and are deemed significant.

The x-axis of the coefficient path analysis shows the log of lambda. A low lambda has less regularization and is closer to ordinary least squares. A high lambda indicates higher regularization and at this lambda many coefficients will shrink to zero. The Y-axis shows the coefficient values for each variable predictor for financial loss. The lines shows the path of each coefficient has lambda differs. The dashed line shows the optimal level of shrinkage. Coefficients to the right of this line are more forced towards zero. At this chosen lambda only certain variables will be non-zero and are deemed significant.

pander(data.frame(colnames(X_train)[c(27, 5, 20, 12)]),caption= "Feature Variables at log(Lambda) = 0")
Feature Variables at log(Lambda) = 0
colnames.X_train..c.27..5..20..12..
Incident.Resolution.Time..in.Hours.
Attack.TypeMan-in-the-Middle
Attack.SourceUnknown
Target.IndustryHealthcare
pander(data.frame(colnames(X_train)[c(12, 20, 5, 26, 14, 1, 35, 9, 39, 36, 4, 3, 21, 19, 2,6)]),caption= " Feature Variables at log(Lambda) = -1")
Feature Variables at log(Lambda) = -1
colnames.X_train..c.12..20..5..26..14..1..35..9..39..36..4..3..
Target.IndustryHealthcare
Attack.SourceUnknown
Attack.TypeMan-in-the-Middle
Defense.Mechanism.UsedVPN
Target.IndustryRetail
Country
NA
Target.IndustryBanking
NA
NA
Attack.TypeMalware
Attack.TypeDDoS
Security.Vulnerability.Type
Attack.SourceNation-state
Year
Attack.TypePhishing
par(mar=c(5,4,6,3))
##
plot(cv_lasso, main = "RMSE Plot: LASSO",
     cex.main = 0.9)
The plot of performance shows that as lambda increases, the MSE decreases. The two vertical lines give a reference of lambda.

The plot of performance shows that as lambda increases, the MSE decreases. The two vertical lines give a reference of lambda.

# Extract coefficients for the best lambda
best.lasso.lambda <- cv_lasso$lambda.min
best.ridge.lambda <- cv_ridge$lambda.min
best.elastic.net.lambda <- cv_elastic_net$lambda.min
##
# Lasso Regression (L1 Regularization): 
# CAUTION: model formula differs from the regular regression formula 
lasso_model.opt <- glmnet(X_train, 
                      y_train, 
                      alpha = 1,      # lasso regression 
                      lambda = best.lasso.lambda,standardize = TRUE)   # useser selected alpha, optimal lambda
                                    # can be obtained through CV (see below)
lasso_predictions.opt <- predict(lasso_model.opt, 
                             s = best.lasso.lambda, # user selected lambda value 
                                      # (regularization paremeter)
                             newx = X_test)  # test data set 
# The following RMSE of prediction serves as a validation - one step validation
lasso_rmse.opt <- sqrt(mean((y_test - lasso_predictions.opt)^2))   
                                          
# Ridge Regression (L2 Regularization)
ridge_model.opt <- glmnet(X_train, y_train, alpha = 0, lambda = best.ridge.lambda,standardize=TRUE)
ridge_predictions.opt <- predict(ridge_model.opt, s = best.ridge.lambda, newx = X_test)
ridge_rmse.opt <- sqrt(mean((y_test - ridge_predictions.opt)^2))

# Elastic Net (Combination of L1 and L2)
elastic_net_model.opt <- glmnet(X_train, y_train, alpha = 0.5, lambda = 0.1,standardize = TRUE)
elastic_net_predictions.opt <- predict(elastic_net_model.opt, s = 0.1, newx = X_test)
elastic_net_rmse.opt <- sqrt(mean((y_test - elastic_net_predictions.opt)^2))

RMSE.opt = cbind(LASSO.opt = lasso_rmse.opt, 
                 Ridge.opt =  ridge_rmse.opt, 
                 Elasticnet.opt = elastic_net_rmse.opt)

pander(RMSE.opt)
LASSO.opt Ridge.opt Elasticnet.opt
28.79 28.72 28.89

The optimal lambda for each type of linear regression is shown above.

LASSO Regression Equation

Now, a final model will be extracted using LASSO, Ridge, and Elastic Net.

#Lasso
# Extract coefficients with names from glmnet
coef_lasso <- coef(cv_lasso, s = best.lasso.lambda)
coef_df <- as.data.frame(as.matrix(coef_lasso))
coef_df$feature <- rownames(coef_df)
colnames(coef_df)[1] <- "coefficient"

# Filter non-zero coefficients
nonzero_coefs <- subset(coef_df, coefficient != 0)

# Separate intercept
intercept <- round(nonzero_coefs$coefficient[nonzero_coefs$feature == "(Intercept)"], 4)
nonzero_terms <- subset(nonzero_coefs, feature != "(Intercept)")

# Build model equation string
equation <- paste0(round(nonzero_terms$coefficient, 4), "*", nonzero_terms$feature)
cat("Model equation: y =", intercept, "+", paste(equation, collapse = " + "), "\n")
Model equation: y = 50.5369 + 0.6683*Attack.TypeDDoS + -0.4171*Target.IndustryEducation + 0.6626*Target.IndustryGovernment + -0.3464*Target.IndustryHealthcare + 0.265*Attack.SourceHacker Group + -0.5335*Attack.SourceInsider + 0.0065*Defense.Mechanism.UsedAntivirus 

Ridge Regression Equation

# Extract coefficients with names from glmnet
coef_ridge <- coef(cv_ridge, s = best.ridge.lambda)
coef_df <- as.data.frame(as.matrix(coef_ridge))
coef_df$feature <- rownames(coef_df)
colnames(coef_df)[1] <- "coefficient"

# Filter non-zero coefficients
nonzero_coefs <- subset(coef_df, coefficient != 0)

# Separate intercept
intercept <- round(nonzero_coefs$coefficient[nonzero_coefs$feature == "(Intercept)"], 4)
nonzero_terms <- subset(nonzero_coefs, feature != "(Intercept)")

# Build model equation string
equation <- paste0(round(nonzero_terms$coefficient, 4), "*", nonzero_terms$feature)
cat("Model equation: y =", intercept, "+", paste(equation, collapse = " + "), "\n")
Model equation: y = 50.5369 + 0.0114*Year + 0.0521*Attack.TypeDDoS + -0.0125*Attack.TypeMalware + 0.0145*Attack.TypeMan-in-the-Middle + -0.0171*Attack.TypePhishing + -0.0201*Attack.TypeRansomware + -0.0186*Attack.TypeSQL Injection + 7e-04*Target.IndustryBanking + -0.0445*Target.IndustryEducation + 0.0576*Target.IndustryGovernment + -0.0398*Target.IndustryHealthcare + 0.0036*Target.IndustryIT + 0.0055*Target.IndustryRetail + 0.0187*Target.IndustryTelecommunications + 6e-04*Number.of.Affected.Users + 0.0427*Attack.SourceHacker Group + -0.0496*Attack.SourceInsider + 0.0182*Attack.SourceNation-state + -0.0106*Attack.SourceUnknown + -0.0045*Defense.Mechanism.UsedAI + 0.0233*Defense.Mechanism.UsedAntivirus + -0.0193*Defense.Mechanism.UsedEncryption + 0.0027*Defense.Mechanism.UsedFirewall + -0.0029*Defense.Mechanism.UsedVPN + -0.0122*Incident.Resolution.Time..in.Hours. 

Elastic Net Regression Equation

# Extract coefficients with names from glmnet
coef_elastic <- coef(cv_elastic_net, s = best.elastic.net.lambda)
coef_df <- as.data.frame(as.matrix(coef_elastic))
coef_df$feature <- rownames(coef_df)
colnames(coef_df)[1] <- "coefficient"

# Filter non-zero coefficients
nonzero_coefs <- subset(coef_df, coefficient != 0)

# Separate intercept
intercept <- round(nonzero_coefs$coefficient[nonzero_coefs$feature == "(Intercept)"], 4)
nonzero_terms <- subset(nonzero_coefs, feature != "(Intercept)")

# Build model equation string
equation <- paste0(round(nonzero_terms$coefficient, 4), "*", nonzero_terms$feature)
cat("Model equation: y =", intercept, "+", paste(equation, collapse = " + "), "\n")
Model equation: y = 50.5369 + 0.4941*Attack.TypeDDoS + -0.2438*Target.IndustryEducation + 0.5433*Target.IndustryGovernment + -0.168*Target.IndustryHealthcare + 0.1465*Attack.SourceHacker Group + -0.3984*Attack.SourceInsider 

Comparing Linear Regression with SVR

cyber <- read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/Global_Cybersecurity_Threats_2015-2024.csv')

#One-Hot Encoding
dummy <- dummyVars("~.", data=cyber)
cyber2<-data.frame(predict(dummy, newdata=cyber))

#####
# Split data into features (X) and target (y)
X <- cyber2[, -25]  # All columns except the target variable
y <- cyber2[, 25]   # Target variable (Financial Loss in Millions)

# create dummy variables



#####
# Split data into training and testing sets
set.seed(123)
train.index <- sample(1:nrow(cyber2), 0.8 * nrow(cyber2))
X.train <- X[train.index, ]
y.train <- y[train.index]
X.test <- X[-train.index, ]
y.test <- y[-train.index]

#####
cyber2.train <- as.data.frame(cyber2[train.index, ])
cyber2.test <- cyber2[-train.index, ]
#####
## Set up custom cross-validation control
my_tune_control <- tune.control(
  cross = 5,  # Use 5-fold cross-validation, the default is 10-fold cross-validation
  nrepeat = 1 # Number of repetitions (for repeated cross-validation)
)
#####
# Perform grid search for hyperparameter tuning: RBF kernel is used
tune.RBF <- tune(svm,
                 train.x = X.train, 
                 train.y = y.train, 
                    ranges = list(epsilon = seq(0.1, 0.5, 0.1), 
                                  cost = c(1, 10, 100), 
                                  gamma = c(0.01, 0.1, 1)), # Hyperpar in RBF
                    tunecontrol = my_tune_control    # 5-fold cross-validation

                                    )

####
# Display the best parameters
#print(tune.results$best.parameters)
#####
# Train the final model using the best parameters
final.RBF<- svm(X.train, y.train, 
                   type = "eps-regression",  # Use "nu-regression" for nu-SVR
                   kernel = "radial", 
                   epsilon = tune.RBF$best.parameters$epsilon, 
                   cost = tune.RBF$best.parameters$cost, 
                   gamma = tune.RBF$best.parameters$gamma)
#####
# Make predictions on the test set
pred.RBF <- predict(final.RBF, X.test)

# Evaluate performance
mse.RBF <- mean((y.test - pred.RBF)^2)    # mean square error
mae.RBF <- mean(abs(y.test - pred.RBF))   # mean absolute error

#### linear kernel

# Perform grid search for hyperparameter tuning: RBF kernel is used
tune.lin <- tune(svm, train.x = X.train, train.y = y.train, 
                    ranges = list(epsilon = seq(0.1, 0.5, 0.1), 
                                  cost = c(1, 10, 100)), 
                    tunecontrol = my_tune_control # 5-fold cross-validation
                    )
####
# Display the best parameters
#print(tune.lin$best.parameters)
#####
# Train the final model using the best parameters
final.lin<- svm(X.train, y.train, 
                   type = "eps-regression",  # Use "nu-regression" for nu-SVR
                   kernel = "linear", 
                   epsilon = tune.lin$best.parameters$epsilon, 
                   cost = tune.lin$best.parameters$cost)
#####
# Make predictions on the test set
pred.lin <- predict(final.lin, X.test)

# Evaluate performance
mse.lin <- mean((y.test - pred.lin)^2)    # mean square error
mae.lin <- mean(abs(y.test - pred.lin))

# After tuning is complete
## ordinary LSE regression model with stepwise variable selection
lse.fit <- lm(Financial.Loss..in.Million...~.,data=cyber2.train)
AIC.fit <- stepAIC(lse.fit,direction="both", trace = FALSE)
pred.lse <- predict(AIC.fit, X.test)
mse.lse <- mean((y.test - pred.lse)^2)    # mean square error
mae.lse <- mean(abs(y.test - pred.lse))   # mean absolute error
###
par(mfrow=c(2,2), mar=c(2,2,2,2))
plot(AIC.fit)
The plots are a diagnostic plot for the regression model.

The plots are a diagnostic plot for the regression model.

###
Performance <- data.frame(RBF.SVR=c(mse.RBF, mae.RBF),
                          Linear.SVR = c(mse.lin, mae.lin),
                          LSE.Reg =c(mse.lse, mae.lse))
row.names(Performance) <- c("MSE", "MAE")
##
pander(Performance)
  RBF.SVR Linear.SVR LSE.Reg
MSE 811.8 829.2 810.4
MAE 24.54 24.75 24.45

Based on our MSE and MAE scores, our least squares estimation has the best model. It has lower MSE and MAE scores indicating a better fit model.

---
title: "Project 2 Part 1"
author: "Gabriella Khalil"
date: "2025-03-26"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: no
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 18px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
  font-weight: bold;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
    font-weight: bold;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
    font-weight: bold;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
library(tidyverse)
}
if (!require("GGally")) {
   install.packages("GGally")
library(GGally)
}

library(ggplot2)
library(plotly)
library(cowplot)
library(pander)
library(moments)
library(caret)
library(glmnet)
# General
library(e1071)        # For SVM       # For Ridge, Lasso, logistic regression
library(pROC)         # For ROC curves
library(Metrics)      # For MSE/MAE
library(MASS)
# Optional visualization
library(gridExtra)
knitr::opts_chunk$set(echo = TRUE,   # include code chunk in the output file
                      warning = FALSE,# sometimes, you code may produce warning messages,
                                      # you can choose to include the warning messages in
                                      # the output file. 
                      results = TRUE, # you can also decide whether to include the output
                                      # in the output file.
                      message = FALSE,
                      comment = NA
                      )  

```

# Analysis of Cyber-security Threats

The purpose of this data analysis is to help model and predict cyber-security threats to help optimize global digital security. As well, to analyze which defense mechanisms that are optimal in combating cyber attacks. The dataset contains 10 variables that are potential significant covariates in predicting potential threats. The variable country refers to the country where the attack occurred. The other variables analyze incident resolution time in hours, number of affected users, financial loss, attack type, target industry, attack source, security vulnerability type and defense mechanism used. Here is a full list of variables:

-   **Country** country where the attack occurred
-   **Year** year of the incident
-   **Attack Type** type of cyber-security threat
-   **Target Industry** industry targeted
-   **Financial Loss** estimated financial loss in millions
-   **Number of Affected Users** number of users impacted by the attack
-   **Attack Source** origin of the attack
-   **Security Vulnerability Type** type of vulnerability exploited
-   **Defense Mechanism Used** cyber-security defense strategy applied
-   **Incident Resolution Time** time taken to fully resolve event in hours

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='This plot shows the relationship between security vulnerability type and the interaction between the number of affected users and financial loss in millions. In most cases the number of affected users seems to be positively correlated with financial loss. The only exception is seen in zero-day. zero-day refers to a type of vulnerability where the target of the attack is unaware of the attack. Meaning there are zero days for the problem to be solved before it has been exploited. It would make sense that in this type of vulnerability, the fewer users affected would lead to a greater financial loss.'}
cyber <- read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/Global_Cybersecurity_Threats_2015-2024.csv')
cyber$Defense.Mechanism.Used <- ifelse(cyber$Defense.Mechanism.Used == "AI-based Detection", "AI", cyber$Defense.Mechanism.Used)

# plot 1 number of affected users by financial loss by vulnerability type


ggplot(cyber, aes(x = cyber$Number.of.Affected.Users, y = cyber$Financial.Loss..in.Million..., color = cyber$Security.Vulnerability.Type)) +
geom_smooth(aes(x = cyber$Number.of.Affected.Users, y = cyber$Financial.Loss..in.Million..., color = cyber$Security.Vulnerability.Type), method = "lm", se = FALSE) +
  scale_size(range = c(2, 10)) +
  labs(
    title = "Plot of Security Vulnerability, Volume Compromised, and Financial Impact",
    x = "Volume Compromised",
    y = "Financial Impact",
    color = "Security Vulernability"
  ) +
  theme_minimal()
```

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='This plot shows the density of number of affected users in cybersecurity attacks in each country in the year 2015. '}
# plot 3 Density plot of attacked industries
Year_data <- cyber %>%
  filter(Year == 2015)
p1<- ggplot(Year_data, aes(x = Number.of.Affected.Users, fill = Country)) +
  geom_density(alpha = 0.6) +
  labs(
    x = "Number of Affected Users",
    y = "Density",
    fill="Country",
    title = "Density Plot of Attacked Countries in 2015"
  ) +
  theme_minimal()
ggplotly(p1)
```

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap="This plot shows the density of number of affected users in cybersecurity attacks in each country in the year 2024. "}
 Year_data1 <- cyber %>%
  filter(Year == 2024)
# plot 4 Density plot of Attacked countries
p2 <- ggplot(Year_data1, aes(x = Number.of.Affected.Users, fill = Country)) +
  geom_density(alpha = 0.6) +
  labs(
    x = "Number of Affected Users",
    y = "Density",
    fill="Country",
    title = "Density Plot of Attacked Countries in 2024"
  ) +
  theme_minimal()
ggplotly(p2)

```

Each country is given a specific color and the number of affected users is displayed on the x-axis. The density is displayed on the y-axis. Density in this plot shows the relative frequency of number of affected users per attack in each country.In 2015, the cyber attacks within the USA more frequently affected about 310,000 users, compared to Russia where attacks more frequently attacked 670,000 users and Germany where most attacks more frequently affected 790,000 users. In 2024, the number of affected users in the USA does not seem to be much more dense in any one number and is more dispersed. Japan had more frequent attacks that affected 500,000 and Brazil had it's most frequent at 750,000 users.

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='This plot shows the relationship between security vulnerability type and the interaction between the year of attack and resolution time. The plot shows how over the years resolution time has improved for certain vulnerability types, such as weak passwords, unpatched software, and social engineering. However, for zero-day there is an increase in resolution time, showing a positive relationship between zero-day and resolution time. '}

ggplot(cyber, aes(x = cyber$Year, y = cyber$Incident.Resolution.Time..in.Hours., color = cyber$Security.Vulnerability.Type)) +
geom_smooth(aes(x = cyber$Year, y = cyber$Incident.Resolution.Time..in.Hours., color = cyber$Security.Vulnerability.Type), method = "lm", se = FALSE) +
  scale_size(range = c(2, 10)) +
  labs(
    title = "Plot of Security Vulnerability, Year of Attack, and Resolution Time",
    x = "Year of Attack",
    y = "Incident Resolution Time in Hours",
    color = "Security Vulernability"
  ) +
  theme_minimal()

```

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='This plot focuses on zero-day vulnerability and shows the proportion of threat types in each industry. The x-axis shows each threat type and the y-axis shows the proportion of attacks within the dataset that fall into each category. The different colors are meant to show a subcategory, target industry. This shows the proportion of the dataset that falls within each category of threat type in each target industry.'}
#plot of zero-day attack type by target industry
zero_data <- cyber %>%
  filter(Security.Vulnerability.Type == "Zero-day") 
a <- ggplot(zero_data, aes(x = zero_data$Attack.Type, fill =zero_data$Target.Industry )) +
  geom_bar(position = "dodge") +
  labs(
    x = "Threat Type",
    y = "Proportion",
    fill = "Target Industry",
    title = "Zero-day Incidences by Threat Type vs. Target Industry"
  ) +
  theme_minimal()
ggplotly(a)
```

Zero-day attacks seem to commonly use Phising within both retail and IT industries, based on the height and count of the bar in these two categories. Ransomware is the most minimally used zero-day attack on the government, with a count of just 12.

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The bloxplot shows the relationship between defense mechanism and resolution time in hours.'}

col0=c("#4682B4", "#B4464B","#9900FF","#FF66FF", "#009933")
boxplot(Incident.Resolution.Time..in.Hours. ~ Defense.Mechanism.Used,
        data=cyber,
        xlab="Defense Mechanism",
        ylab="Resolution Time (hours)",
        col = col0,
        main="Incident Resolution Time by Defense Mechanism",
        cex.main = 0.9,
        col.main = "blue")


```

The boxplot shows that the average resolution time among attacks based on defense mechanisms is relatively the same. The only exception seems to be firewall, which appears to take slightly less resolution time on average.We can indicate the average by looking the black horizontal line within each box. They all appear to line up with one another indicating similar averages.

# Skewness

```{r}

cyber$Country[cyber$Country=="USA"]<-1
cyber$Country<- ifelse(cyber$Country!="USA",0)
cyber$Security.Vulnerability.Type[cyber$Security.Vulnerability.Type=="Zero-day"]<-1
cyber$Security.Vulnerability.Type<-ifelse(cyber$Security.Vulnerability.Type!="Zero-day",0)
cyber$Number.of.Affected.Users<-as.numeric(cyber$Number.of.Affected.Users)
cyber$Year<-as.numeric(as.factor(cyber$Year))
cyber$Incident.Resolution.Time..in.Hours.<-as.numeric(cyber$Incident.Resolution.Time..in.Hours.)


c<-skewness(cyber$Financial.Loss..in.Million...)
d<-skewness(cyber$Number.of.Affected.Users)
e<-skewness(cyber$Year) 

Skewness = cbind(Financial.Loss = c, 
                 Affected.Users =  d, 
                 Year = e)

pander(Skewness)
```

The data is normally skewed and no transformation is necessary.

# Linear Regression

Linear regression is explored below:

```{r fig.align='center', fig.width=10, fig.height=12, fig.cap='The x-axis of the coefficient path analysis shows the log of lambda. A low lambda has less regularization and is closer to ordinary least squares. A high lambda indicates higher regularization and at this lambda many coefficients will shrink to zero. The Y-axis shows the coefficient values for each variable predictor for financial loss. The lines shows the path of each coefficient has lambda differs. The dashed line shows the optimal level of shrinkage. Coefficients to the right of this line are more forced towards zero. At this chosen lambda only certain variables will be non-zero and are deemed significant. '}




# Predictors (x) and Response Variable (y)

set.seed(12345)
X <- as.data.frame(cyber[, -5])  

y <- cyber$Financial.Loss..in.Million...


train_index <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- makeX(X[train_index, ])
X_test <- makeX(X[-train_index, ])
y_train <- y[train_index]
y_test <- y[-train_index]

#standardizing the data
preprocess_params <- preProcess(X_train, method = c("center","scale"))
X_train <- predict(preprocess_params, X_train)
X_test <- predict(preprocess_params, X_test)


#fitting the model
fit_lasso <- glmnet(X_train, y_train, alpha = 1)  # Lasso
fit_ridge <- glmnet(X_train, y_train, alpha = 0)  # Ridge
fit_elastic_net <- glmnet(X_train, y_train, alpha = 0.5)  # Elastic Net
#cross-validation
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)       # Lasso CV
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0)       # Ridge CV
cv_elastic_net <- cv.glmnet(X_train, y_train, alpha = 0.5)  # Elastic Net CV

#plot coefficient path
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(fit_lasso, xvar = "lambda", label = TRUE,
     lwd = 1.5,
     main = "Coefficient Path Analysis: LASSO",
     cex.main = 0.9,
     col = rainbow(10))
abline(v = log(1), col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)

```

```{r}

pander(data.frame(colnames(X_train)[c(27, 5, 20, 12)]),caption= "Feature Variables at log(Lambda) = 0")

pander(data.frame(colnames(X_train)[c(12, 20, 5, 26, 14, 1, 35, 9, 39, 36, 4, 3, 21, 19, 2,6)]),caption= " Feature Variables at log(Lambda) = -1")
```

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The plot of performance shows that as lambda increases, the MSE decreases. The two vertical lines give a reference of lambda. '}
par(mar=c(5,4,6,3))
##
plot(cv_lasso, main = "RMSE Plot: LASSO",
     cex.main = 0.9)
```

```{r}
# Extract coefficients for the best lambda
best.lasso.lambda <- cv_lasso$lambda.min
best.ridge.lambda <- cv_ridge$lambda.min
best.elastic.net.lambda <- cv_elastic_net$lambda.min
##
# Lasso Regression (L1 Regularization): 
# CAUTION: model formula differs from the regular regression formula 
lasso_model.opt <- glmnet(X_train, 
                      y_train, 
                      alpha = 1,      # lasso regression 
                      lambda = best.lasso.lambda,standardize = TRUE)   # useser selected alpha, optimal lambda
                                    # can be obtained through CV (see below)
lasso_predictions.opt <- predict(lasso_model.opt, 
                             s = best.lasso.lambda, # user selected lambda value 
                                      # (regularization paremeter)
                             newx = X_test)  # test data set 
# The following RMSE of prediction serves as a validation - one step validation
lasso_rmse.opt <- sqrt(mean((y_test - lasso_predictions.opt)^2))   
                                          
# Ridge Regression (L2 Regularization)
ridge_model.opt <- glmnet(X_train, y_train, alpha = 0, lambda = best.ridge.lambda,standardize=TRUE)
ridge_predictions.opt <- predict(ridge_model.opt, s = best.ridge.lambda, newx = X_test)
ridge_rmse.opt <- sqrt(mean((y_test - ridge_predictions.opt)^2))

# Elastic Net (Combination of L1 and L2)
elastic_net_model.opt <- glmnet(X_train, y_train, alpha = 0.5, lambda = 0.1,standardize = TRUE)
elastic_net_predictions.opt <- predict(elastic_net_model.opt, s = 0.1, newx = X_test)
elastic_net_rmse.opt <- sqrt(mean((y_test - elastic_net_predictions.opt)^2))

RMSE.opt = cbind(LASSO.opt = lasso_rmse.opt, 
                 Ridge.opt =  ridge_rmse.opt, 
                 Elasticnet.opt = elastic_net_rmse.opt)

pander(RMSE.opt)
```

The optimal lambda for each type of linear regression is shown above.

## LASSO Regression Equation

Now, a final model will be extracted using LASSO, Ridge, and Elastic Net.

```{r}
#Lasso
# Extract coefficients with names from glmnet
coef_lasso <- coef(cv_lasso, s = best.lasso.lambda)
coef_df <- as.data.frame(as.matrix(coef_lasso))
coef_df$feature <- rownames(coef_df)
colnames(coef_df)[1] <- "coefficient"

# Filter non-zero coefficients
nonzero_coefs <- subset(coef_df, coefficient != 0)

# Separate intercept
intercept <- round(nonzero_coefs$coefficient[nonzero_coefs$feature == "(Intercept)"], 4)
nonzero_terms <- subset(nonzero_coefs, feature != "(Intercept)")

# Build model equation string
equation <- paste0(round(nonzero_terms$coefficient, 4), "*", nonzero_terms$feature)
cat("Model equation: y =", intercept, "+", paste(equation, collapse = " + "), "\n")


```

## Ridge Regression Equation

```{r}

# Extract coefficients with names from glmnet
coef_ridge <- coef(cv_ridge, s = best.ridge.lambda)
coef_df <- as.data.frame(as.matrix(coef_ridge))
coef_df$feature <- rownames(coef_df)
colnames(coef_df)[1] <- "coefficient"

# Filter non-zero coefficients
nonzero_coefs <- subset(coef_df, coefficient != 0)

# Separate intercept
intercept <- round(nonzero_coefs$coefficient[nonzero_coefs$feature == "(Intercept)"], 4)
nonzero_terms <- subset(nonzero_coefs, feature != "(Intercept)")

# Build model equation string
equation <- paste0(round(nonzero_terms$coefficient, 4), "*", nonzero_terms$feature)
cat("Model equation: y =", intercept, "+", paste(equation, collapse = " + "), "\n")



```

## Elastic Net Regression Equation

```{r}

# Extract coefficients with names from glmnet
coef_elastic <- coef(cv_elastic_net, s = best.elastic.net.lambda)
coef_df <- as.data.frame(as.matrix(coef_elastic))
coef_df$feature <- rownames(coef_df)
colnames(coef_df)[1] <- "coefficient"

# Filter non-zero coefficients
nonzero_coefs <- subset(coef_df, coefficient != 0)

# Separate intercept
intercept <- round(nonzero_coefs$coefficient[nonzero_coefs$feature == "(Intercept)"], 4)
nonzero_terms <- subset(nonzero_coefs, feature != "(Intercept)")

# Build model equation string
equation <- paste0(round(nonzero_terms$coefficient, 4), "*", nonzero_terms$feature)
cat("Model equation: y =", intercept, "+", paste(equation, collapse = " + "), "\n")



```

# Comparing Linear Regression with SVR

```{r}

cyber <- read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/Global_Cybersecurity_Threats_2015-2024.csv')

#One-Hot Encoding
dummy <- dummyVars("~.", data=cyber)
cyber2<-data.frame(predict(dummy, newdata=cyber))

#####
# Split data into features (X) and target (y)
X <- cyber2[, -25]  # All columns except the target variable
y <- cyber2[, 25]   # Target variable (Financial Loss in Millions)

# create dummy variables



#####
# Split data into training and testing sets
set.seed(123)
train.index <- sample(1:nrow(cyber2), 0.8 * nrow(cyber2))
X.train <- X[train.index, ]
y.train <- y[train.index]
X.test <- X[-train.index, ]
y.test <- y[-train.index]

#####
cyber2.train <- as.data.frame(cyber2[train.index, ])
cyber2.test <- cyber2[-train.index, ]
#####
## Set up custom cross-validation control
my_tune_control <- tune.control(
  cross = 5,  # Use 5-fold cross-validation, the default is 10-fold cross-validation
  nrepeat = 1 # Number of repetitions (for repeated cross-validation)
)
#####
# Perform grid search for hyperparameter tuning: RBF kernel is used
tune.RBF <- tune(svm,
                 train.x = X.train, 
                 train.y = y.train, 
                    ranges = list(epsilon = seq(0.1, 0.5, 0.1), 
                                  cost = c(1, 10, 100), 
                                  gamma = c(0.01, 0.1, 1)), # Hyperpar in RBF
                    tunecontrol = my_tune_control    # 5-fold cross-validation

                                    )

####
# Display the best parameters
#print(tune.results$best.parameters)
#####
# Train the final model using the best parameters
final.RBF<- svm(X.train, y.train, 
                   type = "eps-regression",  # Use "nu-regression" for nu-SVR
                   kernel = "radial", 
                   epsilon = tune.RBF$best.parameters$epsilon, 
                   cost = tune.RBF$best.parameters$cost, 
                   gamma = tune.RBF$best.parameters$gamma)
#####
# Make predictions on the test set
pred.RBF <- predict(final.RBF, X.test)

# Evaluate performance
mse.RBF <- mean((y.test - pred.RBF)^2)    # mean square error
mae.RBF <- mean(abs(y.test - pred.RBF))   # mean absolute error

#### linear kernel

# Perform grid search for hyperparameter tuning: RBF kernel is used
tune.lin <- tune(svm, train.x = X.train, train.y = y.train, 
                    ranges = list(epsilon = seq(0.1, 0.5, 0.1), 
                                  cost = c(1, 10, 100)), 
                    tunecontrol = my_tune_control # 5-fold cross-validation
                    )
####
# Display the best parameters
#print(tune.lin$best.parameters)
#####
# Train the final model using the best parameters
final.lin<- svm(X.train, y.train, 
                   type = "eps-regression",  # Use "nu-regression" for nu-SVR
                   kernel = "linear", 
                   epsilon = tune.lin$best.parameters$epsilon, 
                   cost = tune.lin$best.parameters$cost)
#####
# Make predictions on the test set
pred.lin <- predict(final.lin, X.test)

# Evaluate performance
mse.lin <- mean((y.test - pred.lin)^2)    # mean square error
mae.lin <- mean(abs(y.test - pred.lin))

# After tuning is complete



```

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The plots are a diagnostic plot for the regression model.'}

## ordinary LSE regression model with stepwise variable selection
lse.fit <- lm(Financial.Loss..in.Million...~.,data=cyber2.train)
AIC.fit <- stepAIC(lse.fit,direction="both", trace = FALSE)
pred.lse <- predict(AIC.fit, X.test)
mse.lse <- mean((y.test - pred.lse)^2)    # mean square error
mae.lse <- mean(abs(y.test - pred.lse))   # mean absolute error
###
par(mfrow=c(2,2), mar=c(2,2,2,2))
plot(AIC.fit)


```

```{r}
###
Performance <- data.frame(RBF.SVR=c(mse.RBF, mae.RBF),
                          Linear.SVR = c(mse.lin, mae.lin),
                          LSE.Reg =c(mse.lse, mae.lse))
row.names(Performance) <- c("MSE", "MAE")
##
pander(Performance)
```

Based on our MSE and MAE scores, our least squares estimation has the best model. It has lower MSE and MAE scores indicating a better fit model.
